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1  Introduction 


The  title  of  this  article  is,  of  course,  deliberately  [irovoca- 
tive,  in  part  to  capture  the  reader's  attention,  but  in 
part  also  to  make  a  point.  A  cominoii  assumption  of 
researchers  working  in  stereo  vision  is  that  the  goal  of 
stereo  is  to  compute  explicit  3D  information  about  a 
scene,  in  order  to  support  activities  such  as  navigation, 
hand-eye  coordination  and  object  recognition.  While 
there  are  applications  in  which  such  information  can  be 
accurately  computed,  these  doniains  recjuire  very  accu¬ 
rate  camera  calibration  information.  We  suggest  that 
in  many  applications,  it  may  be  difficult  to  attain  and 
maintain  such  accurate  information,  and  hence  we  sug¬ 
gest  that  it  may  be  worthwhile  to  reconsider  what  is  re¬ 
quired  of  a  stereo  algorithm,  in  light  of  the  needs  of  the 
task  that  uses  stereo's  output.  In  particular,  we  examine 
the  role  of  stereo  in  object  recognition,  arguing  that  it 
may  be  more  effective  as  a  means  of  separating  objects 
from  background,  than  as  a  provider  of  3D  information 
to  match  with  object  models.  To  support  this  argument, 
we  provide  a  demonstration  of  a  stereo  algorithm  that 
separates  figure  from  ground  through  attentive  fixation 
on  key  features,  without  explicitly  computing  actual  3D 
information. 

2  Some  Stereo  Puzzles 

It  has  been  common  in  recent  years  within  the  computer 
vision  community  to  consider  the  stereo  vision  problem 
as  consisting  of  three  key  steps  [23],  [27]; 

•  Identifv  a  particular  point  in  one  image  (sav  t''e 
left). 

•  Find  the  point  in  the  other  (say  right)  image  that 
is  a  projection  of  the  same  scene  point  as  observed 
in  the  first  image. 

•  Measure  the  disparity  (or  difference  in  projection) 
between  the  left  and  right  image  points.  Use  knowl¬ 
edge  of  the  relative  orientation  of  the  two  camera 
systems,  plus  the  disparity,  to  determine  the  actual 
distance  to  the  imaged  scene  point. 

These  steps  are  repeated  for  a  large  number  of  points, 
leading  to  a  3D  reconstruction  of  the  scene,  at  those 
points. 

There  are  many  variations  on  this  theme,  including 
whether  to  use  distinctive  features  such  as  edges  or  cor¬ 
ners  as  the  points  to  match,  or  to  simply  use  local 
patches  of  brightness  values,  what  constraints  to  apply  to 
the  search  for  corresponding  matches  (e.g.  epipolar  lines, 
similar  contrast,  similar  orientation,  etc.),  and  whether 
to  restrict  the  relative  orientation  of  the  cameras  (e.g. 
to  parallel  optic  axes).  Nonetheless,  it  has  been  com¬ 
monly  assumed  for  some  time  that  the  hard  part  of  the 
problem  is  solving  for  the  correspondence  between  left 
and  right  image  features.  Once  one  knows  which  points 
match,  it  has  been  assumed  that  measuring  the  dispar¬ 
ity  is  trivial,  and  that  solving  for  the  distance  simply 
requires  using  the  geometry  of  the  cameras  to  invert  a 
.simple  trigonojnetric  projection. 

This  sounds  fine,  but  let's  consider  some  puzzles  about 
this  approach.  The  first  puzzle  is  a  perceptual  one.  illus- 


Figure  1:  ( 'ornsweet  illusion  in  depth. 


trated  in  Figure  1.  This  illusion  is  a  depth  variant  on  the 
standard  (’ornsweet  illusion  in  brightness,  and  is  due  to 
Anstis  et  al.  [2]  (.see  al.so  [37]).  It  consists  of  a  physical 
object  with  two  coplanar  regions  separated  by  a  sharp 
discontinuity,  where  the  regions  immediately  to  the  sides 
of  the  discontinuity  a.  >  smoothly  curved.  These  surfaces 
are  textured  with  random  dot  paint,  to  make  them  visi¬ 
ble  to  the  viewer.  Subjects  are  then  asked  to  determine 
whether  the  two  planar  regions  are  coplanar.  or  sepa¬ 
rated  in  depth,  and  if  it  is  the  latter,  which  surface  is 
closer  and  by  how  much.  Although  physically  the  two 
surfaces  are  in  fact  coplanar,  subjects  consistently  see 
one  of  the  two  surfaces  as  closer  (the  left  side  in  the  case 
of  Figure  1).  The  reported  error  is  .ficni  and  is  consistent 
for  three  different  view  distances:  72.  H.')  and  290fni. 

This  is  clearly  surprising  if  one  believes  that  the  alv  ve 
description  of  the  stereo  process  holds  for  biological  as 
well  as  machine  solutions.  In  particular,  if  the  human 
system  maintains  a  representation  of  reconstructed  di.s- 
tance.  and  if  that  representation  is  accessible  to  queries, 
then  it  is  difficult  to  see  how  human  observers  could  con¬ 
sistently  make  such  a  mistake. 

Additional  stereo  puzzles  are  provided  in  [-10].  which 
the  authors  use  to  argue  that  depth  is  not  computed 
directly  in  humans,  but  is  reconstructed  from  non-zero 
.second  differences  in  depth.  As  a  con.sequence,  they 
demonstrate  that  human  stereo  vision  is  blind  to  con¬ 
stant  gradients  of  depth.  Similar  observations  on  the 
role  of  disparity  gradients  in  reconstructing  depth  are 
given  by  [44], 

It  need  not  be  the  rase  that  machine  stereo  systems 
make  the  same  "mistakes"  as  human  observers,  but  the 
existence  of  such  an  illusion  for  humans  raises  an  in¬ 
teresting  question  about  the  basic  assumptions  of  ap¬ 
proaches  that  reconstruct  distance. 

Consider  a  .second  puzzle  about  the  approach  of 
matching  features,  then  using  trigonometry  to  convert 
into  depth.  As  noted,  for  years  stereo  researchers  have 
assumed  that  the  correspondence  problem  was  the  hard 
part  of  the  task.  Once  correct  correspondences  were 
found,  the  reconstruction  was  a  simple  matter  of  geom¬ 
etry.  This  is  true  in  principle,  but  it  relies  on  finding 


the  intrinsic  parameters  of  tiie  camera  systems  and  tlu* 
extrinsic  parameters  relating  tlie  orientation  of  the  two 
cameras.  While  solutions  exist  for  finding  these  param¬ 
eters  (e.g.  [11]),  such  solutions  appear  to  be  numerically 
unstable  [4o,  4;{].  If  one  does  not  perform  very  care¬ 
ful  calibration  of  the  camera  platform,  the  result  will  be 
very  noisy  reconstructed  distances. 

Of  course,  there  are  circumstances  in  which  careful 
calibration  can  be  performed,  and  in  the,se  ca.ses,  ex¬ 
tremely  accurate  reconstructions  are  possible.  A  good 
example  of  this  is  automated  cartographic  reconstruc¬ 
tion  from  satellite  imagery,  where  commercial  systems 
can  provide  maps  with  accuracy  on  the  order  of  a  few 
meters,  from  satellite  photography  [19],  On  the  other 
hand,  if  the  cameras  are  mounted  on  a  mobile  robot  that 
is  perturbed  as  it  moves  through  the  environment,  then 
it  may  be  more  difficult  to  attain  and  maintain  careful 
calibration.  Thus,  we  see  that  there  are  some  sugges¬ 
tions  that  human  observers  do  not  reconstruct  depth, 
and  some  suggestions  that  one  tieeds  very  careful  cali¬ 
bration  (which  is  often  harci  to  guarantee)  in  order  to  do 
this.  W’e  will  explore  the  calibration  sensitivity  issue  in 
section  d. 

Given  tiiis  puzzle,  it  is  worth  stepping  back  to  ask 
what  one  needs  from  the  output  of  a  stereo  algorithm. 
Aside  from  specialized  tasks  such  as  cartography,  the  two 
standard  general  application  areas  are  navigation  and 
recognition.  Interestingly,  Faugeras  [8]  (see  also  [39])  has 
recently  argued  that  one  can  construct  and  maintain  a 
representation  of  the  scene  structure  around  a  moving 
robot,  without  a  need  for  careful  calibration.  Moreover, 
the  solution  involves  using  relative  coordinate  systems 
to  represent  the  scene,  so  that  there  is  no  metrical  re¬ 
construction  of  the  scene. 

What  about  object  recognition?  We  have  found  it 
convenient  to  separate  the  recognition  problem  into  three 
pieces  [11]: 

•  Selection:  Extract  subsets  of  the  data  features 
likely  to  have  come  from  a  single  object . 

•  Indexing:  Look  up  those  object  models  that  could 
have  given  rise  to  one  such  selected  subset. 

•  Correspondence:  Determine  if  there  is  a  way  of 
matching  model  features  to  data  features  that  is 
consistent  with  a  legal  transformation  of  the  model 
into  the  data. 

We  have  argued  [11]  that  for  many  approaches  to 
recognition,  the  first  stage  is  the  crucial  one.  In  many 
cases,  it  reduces  the  expected  complexity  of  recognition 
from  exponential  to  low-order  polynomial,  and  in  many 
Ccises,  it  is  necessary  to  keep  the  false  positive  rates  under 
control.  If  we  accept  that  the  hard  part  of  recognition  is 
selection,  rather  than  correspondence,  then  this  has  an 
interesting  implication  for  stereo.  If  stereo  were  mainly 
oriented  towards  solving  the  correspondence  problem,  it 
is  natural  to  expect  that  it  needs  to  deliver  accurate  3D 
data  that  can  be  compared  to  3D  models.  But  if  stereo  is 
mainly  intended  to  help  with  the  selection  problem,  then 
one  no  longer  needs  to  extract  exact  3D  reconstructions, 
one  simply  needs  stereo  to  identify  data  feature  subsets 
that  are  roughly  in  the  same  depth  range,  or  equivalently 


do  not  liave  large  variations  iii  dis|)arity.  We  will  exam¬ 
ine  a  modified  stereo  algorithm  in  .section  4  that  takes 
advantage  of  this  observation. 

If  one  accepts  that  stereo  is  primarily  for  segmenta¬ 
tion.  not  for  3D  reconstruction,  this  leaels  to  the  furl  her 
<|uestioii  of  whether  recognition  of  31)  obje<t>  can  be 
done  without  explicit  3D  input  data.  A  mmilx'r  of  re¬ 
cent  techniques  have  shown  interesting  possibilities  along 
these  lines:  for  example,  I  lie  recent  development  of  the 
liiK'ar  combinations  method  [42]  suggests  that  one  could 
u.se  stored  2D  images  of  a  model  to  gen  .-rate  an  hypoth¬ 
esized  2D  image  which  can  then  be  compared  to  the  ob¬ 
served  image.  .Again,  one  does  not  need  to  extract  exact 
3D  data.  It  is  also  intriguing  along  these  lines  to  observe 
that  some  jiliysiological  data  [34.  3’)]  may  also  supiiort 
the  itiea  of  the  human  system  .solving  3D  recognition 
from  purely  2D  views.  Of  course,  it  is  poKuble  to  soi\(> 
the  recognition  prcblem  by  matching  reconstructed  3D 
stereo  data  against  3D  models  [27], 

do  siimmariye  we  coiisidf'r  three  main  points: 

•  the  human  stereo  system  may  not  directly  com|)ute 
3D  depth,  suggesting  that  humans  may  not  need 
explicit  deiUh: 

•  small  inaccuracies  in  measuring  camera  [larameters 
can  lead  to  large  errors  in  computed  depth,  suggest¬ 
ing  that  we  may  not  be  able  to  conqiute  explicit 
depth  accurately; 

•  the  critical  jiart  of  object  recognition  is  fig¬ 
ure/ground  .separation,  which  may  not  retpiire  ex¬ 
plicit  depth  information. 

We  will  use  this  to  argue  that  stereo  can  contribute 
to  the  efficient  solution  of  the  object  recognition  prob¬ 
lem.  without  the  need  for  accurate  calibration  and  with¬ 
out  the  need  for  explicit  depth  computation.  In  this 
case,  the  importance  of  eye  movements  or  related  con¬ 
trol  strategies  is  increased,  causing  us  to  reexamine  the 
structure  of  stereo  algorithms.  Similar  questions  have 
been  by  systems  that  u.se  actively  controlled  stereo  eye- 
head  systems  to  acquire  depth  information  (for  example, 
[1.  T),  (5.  7,  9,  29.  30,  38,  33]). 

3  Why  Reconstruction  is  Too  Sensitive 

While  our  first  point  is  based  primarily  on  earlier  psy¬ 
chophysical  observations,  the  second  point  bears  closer 
e.xamination.  Let's  look  in  more  detail  at  the  problem  of 
computing  distance  from  stereo  disparity.  Su[)pose  our 
two  cameras  have  points  of  projection  located  at  bf  and 
br.  measured  in  some  world  coordinate  system.  Assume 
that  the  optic  axes  are  Zr  and  z,--  and  that  both  cameras 
have  the  same  focal  length  /  (though  we  could  easily 
relax  this  to  hav.'  two  different  focal  lengths). 

In  this  case,  we  lan  repre.sent  the  left  image  plane  by 

{v|  (v.Zf)  =  d(} 

where  (...)  represents  an  inner  (or  dot)  product.  The 
principal  point  (or  image  center)  is  given  by 

Cf  =  bf  +  fzi 


where  we  have  chosen  to  place  the  iniage  plane  in  I'ront  of 
the  projection  point,  to  avoid  the  inversion  of  the  coordi¬ 
nate  axes  of  the  iniage.  Since  vve  know  that  tliis  point  lies 
on  the  iniage  plane,  vve  can  deduce  the  con.stant  offset, 
so  that  the  left  image  plane  is  given  hy 

{v|  (v  -  bf.z,)  =  /}  , 

A  similar  representation  holds  for  the  right  image  plain'. 

Now  an  arbitrary  scene  point  p  maps,  under  perspec¬ 
tive  projection,  to  a  point  p^  on  the  left  image  plane, 
given  hy 

I  ,  /(p-b() 

P,'  =  b(  +  - - ■- 

(p  -  b,  ,z/) 

and  for  convenience  we  write  this  as 
I)^  =  Cf  -I-  df 

where  (df.Zf)  =  IJ.  Here  d/'  is  an  offset  vector  in  the 
iniage  plane  from  the  principal  point: 


df  =  / 


Zf  X  ((p  - 


bf )  X  Zf)\ 
f.Zf)  /' 


Note  that  we  haven't  specified  the  world  coordinate 
system  yet.  and  we  can  now  take  advantage  of  that  free¬ 
dom.  In  particular,  vve  choose  the  origin  of  the  world 
coordinate  system  to  be  centered  between  the  projection 
points,  so  that  bf  =  — b,.  =  b. 

By  subtracting  dr  from  df.  vve  get  the  following  rela- 
tioiisliip 

(p  -  bf .  Zf)  df  -  (p  -  br ,  Zr)  dr  = 

/  [-bf  -p  br  -  (p  -  bf ,  Zf )  Zf  -p  (p  -  br  ,  Zr)  Zr](  1 ) 

For  the  special  case  of  the  origin  centered  between  the 
projection  points,  this  becomes 

(p  -  b,  Zf)  df  -  {p  -P  b.  Zr)  dr  = 

/  [-‘ib  -  {p  -  b,  Zf  )  Zf  -f  (p  -p  b,  Zr)  Zr]  .  (2) 

We  can  isolate  components  of  p  with  respect  to  each 
of  the  two  optic  axes,  by  taking  the  dot  product  of  both 
sides  of  equation  1  or  2  w’ith  respect  to  the.se  unit  vectors. 
This  gives  us  two  linear  equations  (a.ssuming  that  Zf 
Zr).  which  we  can  solve  to  find  these  components  of  p. 
Adding  them  together  yields; 

(p,  Zf  -P  Zr)  = 

[(/-  -p  ft/i)  (b,  Zf  -  Zr)  +  2/  (b.  dzf  -  OZr)] 
Off-/'-’ 

where 

O  =  (dr-p/Zr,Zf) 

=  (df  -P  /Zf.Zr)  . 

To  explore  how  this  computation  of  depth  from  stereo 
measurements  dejiends  on  the  accuracy  of  the  calibrated 
parameters  and  the  disparity  measurements,  we  consider 
the  symmetric  case  of: 

Zf  =  cos-^z-Psin^x 
Zr  =  cos  7Z  —  sin  7X 
b  =  —fix 


where  X  is  choisen  as  the  direction  of  the  vi-ctor  connect¬ 
ing  the  two  renters  of  projection,  and  where  the  two  cam¬ 
eras  make  a  symmi'tric  (though  opjiosite  signeil)  gaze 
angle  7  with  the  z  axis,  and  where  the  offset  of  each 
camera  from  the  origin  is  the  same.  In  this  case,  snlisti- 
tiition  and  matiipulation  leads  to 


(p,z)cos7  = 

2/>  (/-  cos-  7  -P  dr  sin  7  )  (/-'  cos- 
2  sin  7  (/'-’  cos-  7  -p  dydf )  —  /  (cos'-  7 
where  we  have  let 


;•  —  d,  sin  7  ) 
sin- 7)  (dr  -  d,) 


dr  =  (dr.z) 

<lf  =  (df.z). 

.Note  that  in  the  s))ecial  ca.se  of  paral 
0).  this  reduces  to 


optic  axes  (7  = 


which  is  exactly  what  one  would  I'xpect,  since  <//  —  d,.  is 
simply  the  disparity  at  this  jvoint. 

For  convenience,  call  Z  =  (p.z).  This  equation  tells 
us  how  to  compute  the  depth  Z,  given  measuretiients 
for  the  camera  parameters  fj).~  and  the  two  principal 
points  Cf.Cr  as  well  as  the  itnlividual  measurements  of 
displacement  df.dr  (or  equivalently  d(  and  </,.)■ 

'File  question  we  vvatit  to  consider  is  how  accuratc'lv 
do  we  need  to  ktiow  these  paratiieters'.^  There  has  beeti 
.some  previous  analysis  of  stereo  t-rror  itt  the  literature, 
primarily  focused  on  the  effects  of  pixel  (|uantization 
[■43,  2t<.  25],  although  some  analysis  of  the  effects  of  cam¬ 
era  paratiieters  has  also  beet)  dotie  [-15.  41],  Here  we  are 
primarily  interested  in  tht'  effects  of  the  catitera  jiaram- 
eters. 

For  sake  of  simplicity,  we  will  assittiie  that  7  is  stiiall. 
For  example,  if  the  cameras  art'  fixated  at  a  target  1 
meter  removed,  with  an  interocular  separation  of  lOrni, 
then  7  w  .05  radians,  or  if  the  fixation  target  is  .5  meters 
off,  then  7  s:  .1  radiatis.  In  the  second  case,  tht'  stiiall 
angle  approximat  ion  will  lead  to  an  error  in  cos  7  of  at 
most  .005  and  an  error  in  sin  7  of  at  most  .0002,  Using 
the  small  angle  approximation  leads  to 

z^2h - -  (.7) 

27(/-  -P  iirdf)  -  i(dr  -  dt) 

If  we  rewrite  this,  isolating  dejtth  in  terms  of  interocular 
units  {’2b).  and  image  offsets  in  terms  of  focal  length  (or 
equivalently  in  terms  of  angular  arc),  vve  gt't : 

Z  _  1  +  5 

In  some  cases  it  is  more  cotivenient  to  consider  this  ex¬ 
pression  itt  terms  of  relative  units,  that  is  representing 
depth  in  terms  of  interocular  spacing,  by  using 

^'  =  4 

•2b 

and  to  use  disparities  as  angular  arcs  by  using 

,/  dr  I  df 

=  -T  =  -T- 


In  this  case,  we  have 


7'  ~  l+'>(c?r 

2-)  -  ((/;.  -  cl'f]  +  ‘2'id',.d'f 

By  taking  partial  derivatives  of  this  equation  witli  re¬ 
spect  to  eacii  of  tiie  parameters  of  interest  (wliich  we 
treat  as  independent  of  one  anotlier),  we  arrive  at  the 
following  expressions  for  the  relative  change  in  computed 
depth  as  a  function  of  the  relative  error  in  measuring  tin' 
parameters; 

A6 
6 

Ad,.  -,  +  Z'  -  ‘2-)d'fZ' 
f  l+y(d', -r/;) 

Ad(  T+Z'-f2-)<Z' 

/  l  +  ^(c/;. -d') 

^  (-1  +  z')(d;.-d')-4od;d;z' 

/  l  + 

d;-d',-2Z'(i  +  d;d') 

'  ''  l  +  y(d;-d') 

If  we  use  standard  viewing  geometries  (i.e.  focal  length 
much  larger  than  individual  pixel  size,  -  small),  we  can 
approximate  these  expressions  as  follows: 


AZ 

Z 

AZ 

Z 

AZ 

Z 

AZ 

Z 

AZ 
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We  note  that  related  error  expressions  were  obtained 
in  [43],  although  the  focus  there  was  on  the  effects  of  er¬ 
rors  in  the  matching  of  image  features  and  the  quantiza¬ 
tion  of  image  pixels  on  the  accuracy  of  recovered  depth. 

Our  concern  is  how  uncertainty  in  measuring  the  cam¬ 
era  parameters  impacts  the  computed  depth.  Ideally,  we 
would  like  a  linear  relationship,  so  that,  for  example,  a 
1  percent  error  in  computing  a  parameter  would  result 
in  at  most  a  1  percent  error  in  depth. 

To  explore  this,  we  consider  two  cases:  a  camera  sys¬ 
tem  with  15mm  focal  length  and  .015mm  pixels  so  that 
a  pixel  subtends  an  angular  arc  of  .001  radians;  and  the 
human  visual  system,  where  the  fovea  has  a  receptor 
packing  subtending  approximately  .00014  radians. 

By  equation  8,  relative  errors  in  computed  depth  due 
to  mismerisurement  of  the  baseline  separation  are  gen¬ 
erally  quite  small.  For  example,  a  1%  relative  error  in 
measuring  the  ba.seline  will  result  in  a  1%  relative  error 
in  the  computed  distance. 


Equations  0  and  10  are  essentially  the  same.  They 
show  a  non-linear  efl'ect,  in  that  the  relative  error  m  com¬ 
puting  df'pth  is  a  function  both  of  Iht'  relative  error  in 
computing  the  position  ofi'ach  image  point  with  resjiect 
to  the  global  coordinate  frame,  and  more  importantly  is 
a  function  of  the  distance  of  the  object  from  the  viewer, 
in  units  of  interocular  separation  (26).  I  hns,  tin'  relative 
error  will  get  much  worse  for  more  distant  objects.  If  we 
let  the  pixel  error  in  measuring  |)osition  be  k.  then  using 
a  standard  pixel  size  and  focal  length,  the  relative  error 
in  depth  is 

W  26 

for  our  camera  system.  To  see  how  large  this  can  g('l . 
we  need  to  understand  what  can  contribute  to  6.  Fffecis 
include: 

•  image  based  localization  errors 

•  image  ba.sed  matching  errors 

•  registration  er.'ors  between  the  image  and  the  world 
coordinates  due  to; 

-  principal  points 

-  image  orientation 

Uncertainty  and  smoothing  effects  in  the  edge  detec¬ 
tor  will  affect  the  first  source  of  error,  but  typically  will 
only  cause  errors  on  the  order  of  a  few  pixels.  Since 
matching  errors  by  definition  must  lead  to  incorrect 
lepth  reconstructions,  we  ignore  them  in  our  analysis. 
The  second  major  source  of  error  comes  from  convert¬ 
ing  the  image  pixel  measurements  to  world  coordinates, 
and  here  there  are  two  main  sources.  One  is  that  all  of 
our  disparity  measurements  in  the  analysis  above  were 
based  on  the  displacement  of  features  from  the  principal 
points.  This  requires  that  we  measure  those  principal 
points  accurately  [21].  and  this  is  partictilarly  important 
since  in  many  cameras,  the  principal  point  can  often  be 
tens  of  pixels  away  from  the  center  of  the  sensor  array. 
For  example,  the  CCD  cameras  in  use  in  one  of  our  stereo 
setups  have  principal  points  displaced  from  the  image  ar¬ 
ray  center  by  30  pixels  in  x  and  1  pixel  in  y  for  the  left 
camera  and  18  pixels  in  x  and  3  pixels  in  ;/  for  the  right 
camera.  Methods  in  the  literature  for  locating  the  prin¬ 
cipal  points  [21]  are  reported  to  have  residual  errors  of 
at  most  6  pixels. 

Finally,  we  need  to  know  the  orientation  of  the  camera 
rasters  with  respect  to  the  world  axes.  Even  if  we  ignore 
the  effects  of  gaze  angle,  rotation  about  the  optic  axis 
(cyclotorsion)  can  result  in  an  error  in  the  disparity  offset 
with  respect  to  the  interocular  baseline.  Since  this  error 
goes  with  the  cosine  of  the  rotation,  we  expect  the  effects 
of  such  error  to  be  small. 

If  we  have  found  the  principal  points  and  the  orien¬ 
tation  of  the  cameras  with  respect  to  world  coordinates 
accurately,  then  k  will  typically  be  on  the  order  of  a  few 
pixels.  If  we  have  not.  k  can  easily  he  on  the  order  of 
tens  of  pixels.  To  see  the  effect  of  this  on  reconstructed 
depth.  Figure  2  shows  plots  of  the  percentage  relative 
error  in  computing  depth,  as  a  function  of  the  distance 
to  the  object  (measured  in  units  of  interocular  separa¬ 
tion),  for  the  case  of  F  =  1  and  k  =  10.  For  an  object 


4 


d«DCh  error  vm.  Oh-iecc  diaranf* 


Figure  2;  Vertical  axis  is  the  percentage  error  in  comput¬ 
ing  depth,  horizontal  axis  is  the  distance  to  the  object  (in 
units  of  interocular  separation).  Top  graph  is  for  errors 
in  localizing  image  features  of  10  pixels,  bottom  graph 
is  for  1  pixel  errors. 


1  meter  away  from  our  standard  camera  setup,  k  =  10 
leads  to  10%  errors  in  computed  depth.  For  the  human 
system,  these  errors  are  reduced  by  a  factor  of  10.  A 
second  way  of  seeing  this  is  to  ask  what  is  the  accuracy- 
on  pixel  location  needed  to  keep  the  relative  depth  error 
less  than  l%i,  as  a  function  of  the  distance  to  the  object. 
This  is  shown  in  Figure  3. 

By  equation  11,  a  1  percent  error  in  estimating  /  and 
disparities  on  the  order  of  10  pixels,  will  still  only  lead 
to  1  percent  errors  in  relative  depth  for  nearby  objects 
(Z/26  10),  which  is  small.  Note  that  as  the  disparities 

get  larger,  the  error  increases  This  has  the  interesting 
implication  that  if  the  object  of  interest  is  roughly  fix¬ 
ated  (i.e.  the  two  optic  axes  intersect  at  or  near  the 
object)  then  disparities  for  features  on  the  objects  will 
be  small,  and  the  depth  error  will  be  small,  w-hile  objects 
at  larger  disparities  will  have  larger  errors.  Note  that  a 
similar  observation  has  been  made  by  Olson  [31]  who 
shows  that  much  of  the  sensitivity  of  depth  reconstruc¬ 
tion  to  camera  parameters  can  be  isolated  in  the  compu¬ 
tation  of  the  depth  of  the  fixation  point,  while  relative 
depth  of  other  points  with  respect  to  this  fixation  can  be 
computed  fairly  accurately. 

All  of  this  analysis  is  encouraging.  Consider  equation 
12,  however.  Here,  a  1  degree  error  in  estimating  the 
gaze  angle  will  lead  to  34  percent  relative  depth  errors 
for  nearby  objects  {Zj2h  s»  10),  and  even  a  .5  degree  gaze 
angle  error  will  lead  to  17  percent  relative  depth  errors. 
This  is  graphed  in  more  detail  in  Figure  4,  Similarly,  in 
Figure  5,  we  plot  the  accuracy  in  gaze  angle  needed  to 
keep  the  relative  depth  error  at  most  1%,  as  a  function 


Figure  3:  Vertical  axis  is  the  accuracy  in  pixel  location 
needed  so  that  the  relative  error  in  depth  is  less  than  IV? . 
horizontal  axis  is  the  distance  to  the  object  (in  units  of 
interocular  separation). 


of  distance  to  the  object. 

^e  note  that  errors  due  to  gaze  angle  calibration 
could  be  a  real  problem.  It  is  interesting  to  note  that 
the  human  system  appears  able  to  measure  gaze  angle 
only  up  to  an  accuracy  of  roughly  1  degree  [IG]  (page 
67). 

In  short,  we  need  to  be  certain  that  we  have  estimated 
the  principal  points  accurately,  and  that  we  have  very- 
accurate  measurements  of  the  gaze  angles  of  the  cam¬ 
eras.  If  we  cannot  do  so.  then  we  w-ill  suffer  distortion  in 
our  computed  depth.  More  importantly,  that  distortion 
varies  with  actual  depth,  so  the  effect  is  non-linear.  If  we 
are  trying  to  recognize  an  object  whose  extent  in  depth 
is  small  relative  to  the  distance  to  its  centroid,  then  the 
effect  of  this  noise  sensitivity  is  reduced.  This  is  because 
the  effect  of  the  error  will  be  systematic,  and  in  the 
case  of  small  relative  depth,  this  uncertainty  basically 
becomes  a  constant  scale  factor  on  the  computed  depth. 
On  the  other  hand,  however,  if  the  object  has  notice¬ 
able  relative  extent  in  depth  (even  on  the  order  of  a  few- 
percent),  then  the  uncertainty  in  computing  depth  will 
skew  the  results,  causing  difficulties  for  most  recognition 
methods  that  compare  computed  3D  structure  against 
stored  models.  Thus,  the  .sensitivity  may  cause  serious 
problems  for  recognition  methods,  both  due  to  the  large 
errors  in  depth  and  due  to  the  distortions  with  varving 
depth. 

4  Another  Look  at  Stereo 

Given  that  it  may  be  difficult  to  reliably  compute  dis¬ 
tance,  and  that  distance  may  not  be  needed  to  handle 
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Figure  4:  Vertical  axis  is  the  percentage  error  in  comput¬ 
ing  depth,  horizontal  axis  is  the  distance  to  the  object 
(in  units  of  interocular  separation),  (iraphs  are  for  er¬ 
rors  in  computing  the  gaze  angle  of  1,  .5  and  .‘i-l  degrees, 
from  top  to  bottom. 


Accuracy  limit  oii  gaze  error  vs  object  distance 


Figure  .5:  Vertical  axis  is  the  accuracy  in  gaze  angle  (in 
degrees)  needed  so  that  the  relative  error  in  depth  is  less 
than  1%,  horizontal  axis  is  the  distance  to  the  object  (in 
units  of  interocular  separation). 


the  two  main  u.se.s  of  stereo  output,  we  suggest  that 
it  is  useful  to  reconsider  the  performance  requirements 
that  stereo  should  satisfy  to  support  tasks  such  as  ob¬ 
ject  recognition.  Fo  handle  figure/groiind  separation,  a 
stereo  algorithm  should: 

•  be  able  to  detect  proximal  (in  the  image)  features 
that  lie  within  some  range  of  depth  (i.e.  find  points 
that  are  near  one  another  in  dl)  S|)afe,  even  if  one 
does  not  know  exactly  where  in  31)), 

•  be  able  to  align  matching  distinctivi'  features  so 
that  they  are  centered  in  the  two  images,  to  ensure 
that  nearby  p.arts  of  the  corresponding  object  are 
visible  in  both  images  and  can  be  matched. 

•  be  able  to  integrate  other  visual  cues  about  jKissible 
trigger  features  to  foveate  and  fixate. 

First,  we  should  consider  whether  we  can  use  exist¬ 
ing  stereo  algorithms  (e.g.  [10],  [4],  [2()].  [30],  [14])  to 
tackle  the  problem  of  figure/ground  separation.  We  can 
conveniently  separate  stereo  processing  into  several  com¬ 
ponents: 

•  Choice  of  features  to  match:  for  our  discussions, 
we  will  consider  only  edge  ba.sed  stereo  matching. 

•  Constraints  on  the  matching  process. 

•  Control  mechanism  used  to  guide  the  matching 
process. 

Most  current  stereo  algorithms  solve  the  correspon¬ 
dence  problem  as  follows;  Given  any  left  image  edge, 
search  the  set  of  right  image  edges  for  a  unique  match. 
The  search  is  usually  constrained  by  the  (a.ssumed 
known)  epipolar  geometry,  and  by  a  set  of  similarity 
constraints  (e.g.  edges  should  have  similar  orientation, 
similar  contrast  (or  intensity  variation),  and  so  on).  This 
holds  both  for  matching  individual  edge  points  (in  which 
case  additional  constraints  such  as  figural  continuity  may 
also  apply)  and  for  extended  edge  fragments. 

The  key  question  is  what  constitutes  a  unique  match, 
and  this  depends  on  the  control  mechanism  used  by  the 
algorithm.  For  example,  most  of  these  algorithms  at¬ 
tempt  to  find  matches  over  a  wide  range  of  disparity, 
reflecting  the  fact  that  the  viewed  scenes  may  have  ob¬ 
jects  ranging  from  close  to  the  viewer  (less  than  1  meter) 
out  to  objects  at  the  horizon.  This  can  easily  translate 
into  disparity  ranges  on  the  order  of  several  hundred  pix¬ 
els.  The  problem  is  that  under  these  circumstances,  it 
may  be  very  difficult  to  guarantee  uniqueness  of  match, 
especially  when  one  is  only  considering  local  attributes 
of  features,  such  as  orientation  and  local  contrast.  One 
solution  is  to  incorporate  local  geometric  information 
about  nearby  edges  [3],  [29].  But  an  alternative  is  to 
consider  changing  the  control  mechanism. 

The  key  problem  is  that  previous  stereo  algorithms 
had  as  their  goal  the  reconstruction  of  the  scene,  and 
hence  they  were  designed  to  find  as  many  correct 
matches  as  possible,  over  a  wide  range  of  disparities. 
On  the  other  hand,  if  all  we  are  interested  in  is  sepa¬ 
rating  out  candidate  image  features  that  are  likely  to 
correspond  to  a  single  object,  and  we  are  willing  to  al¬ 
low  edge  features  to  participate  in  several  such  groups. 
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then  an  alternative  control  method  is  viable.  In  particu¬ 
lar,  since  vve  are  interested  in  finding  roughly  contiguous 
3D  regions,  it  is  attractive  to  envision  a  control  method 
in  which  one  fixates  at  some  target,  then  searches  for 
matching  features  within  .some  range  of  disparity  about 
that  fixation  point,  collecting  all  such  matching  features 
as  a  candidate  object,  aiul  continues. 

Such  an  algorithm  is  similar  in  approach  to  some  ear- 
Ikr  stereo  methods,  notably  [‘23,  ‘27,  3].  and  it  bears 
some  similarity  to  evidence  of  the  human  stereo  .system, 
particular  in  the  restriction  of  matching  disparities  oidy 
over  a  narrow  range  about  the  fixation  point  (referred 
to  as  Panum's  limit  in  the  perceptual  literature)  and  the 
role  of  eye  movements  in  guiding  stereo  {‘23,  ‘27,  31].  It 
also  clearly  relates  to  work  in  active  stereo  head  systems 
[1,  5.  6.  7,  9,  ‘20,  30,  38,  33],  especially  work  on  using 
saliency  of  low  level  cues,  or  using  motion  information 
to  drive  stereo  control  loops  that  fixate  candidate  target 
areas  [9,  6,  Ij,  30,  38,  33], 

To  demonstrate  this  idea,  we  have  implemented  the 
following  stereo  algorithm  (influenced  in  part  by  earlier 
algorithms  [3],  [‘29]). 

•  Process  both  images  to  extract  intensity  edges.  For 
convenience,  process  these  edges  to  extract  linear 
segnients,  using  a  standard  split-and-merge  algo¬ 
rithm.  This  latter  step  is  maiidy  for  reduction  in 
computation  and  is  not  central  to  the  demonstra¬ 
tion. 

•  For  each  linear  feature  segment,  record  the  position 
of  the  two  endpoints,  and  the  average  intensity  on 
each  side  of  the  feature.  Also  record  the  distance 
from  each  endpoint  to  other  nearby  features. 

•  Find  a  distinctive  feature  in  one  image  that  has  a 
unique  match  in  the  other  image,  as  measured  over 
the  full  range  of  possible  disparities.  To  begin  with, 
we  will  measure  distinctiveness  as  a  combination  of 
the  length  of  the  feature  and  the  contrast  of  the 
feature.  The  idea  is  that  such  a  feature  can  serve 
as  a  focal  trigger  featnr"  Of  cours-^  »i»ny  other 
cues  could  serve  to  focus  attention  [‘2’2]. 

•  Rotate  both  cameras  so  that  the  distinct  feature 
and  its  match  are  both  centered  in  the  cameras. 
This  is  a  simple  version  of  a  fixation  mechanism,  in 
which  the  trigger  feature  is  foveated  and  fixated  in 
both  cameras.  Note  that  this  will  in  general  cause 
the  optic  axes  to  be  non-parallel  so  that  epipolar 
lines  will  no  longer  lie  along  horizontal  rasters,  A 
simpler  version  just  uses  a  pan  and  tilt  motion  of 
the  cameras  to  center  the  feature  in  one  image, 
while  leaving  the  optic  axes  parallel. 

•  Within  a  predefined  range  of  disparity  ±f> 
(Panum's  limit)  about  the  zero  disparity  position 
(due  to  fixation),  search  for  other  features  that  have 
a  unique  match.  Note  that  uniqueness  here  means 
only  within  this  range  of  disparity.  There  may  be 
other  edges  outside  of  this  disparity  range  that  sat¬ 
isfy  the  matching  constraints,  but  in  this  case  such 
matches  are  ignored.  In  our  implementation,  two 
edges  match  if  their  lengths  are  roughly  the  .same. 


if  a  significant  fraction  of  each  edge  has  an  epipo¬ 
lar  overlap  with  the  other  edge,  if  the  orientation 
is  roughly  the  same,  if  the  average  intensity  on  at 
lea.sl  one  .side  of  the  edge  is  roughly  the  same,  and 
li  the  arrangement  of  neighbouring  edges  at  one  iif 
the  endiioints  is  roughly  the  same. 

•  1  his  .set  ol  edges  now  consistutes  an  hy|iot hesized 
fragment  of  a  single  object.  We  can  save  tlusi 
eilges,  and  continue  the  process,  looking  for  an¬ 
other  uniciue  trigger  feature  to  align  the  cameras 
Alternatively,  we  can  pa.s.s  the.se  edge  features  on  to 
a  recognition  algorithm, such  as  .-Migmiient  [17.  Ihj. 

We  have  implemented  an  initial  version  of  this  algo¬ 
rithm,  and  u.sed  it  in  conjunction  with  an  eye-head  sys¬ 
tem,  which  can  jian  and  tilt  as  a  unit,  as  well  as  change 
the  optic  axes  of  one  or  both  cameras.  An  example  of 
this  algorithm  in  operation  is  shown  in  Figures  (i  ll. 
(liven  the  images  in  Figure  (>,  we  extract  edges  (Figure 
7).  F’rom  this  set  of  edges,  the  most  distinctive  edge 
(measured  a.s  a  combination  of  length  and  intensity  con¬ 
trast)  with  a  unique  match  is  isolated  in  F  igure  8.  Fhis 
enables  the  cameras  to  fixate  the  edge  and  obtain  a  new 
set  of  images  (Figure  9)  and  edges  (Figun'  10).  Relative 
to  this  fixation,  stereo  matching  is  [lerformed  ov(>r  a  nar¬ 
row  range  of  disparity,  i.solating  a  .set  of  edges  likely  to 
come  from  a  single  object  (Figure  11).  .Notice  how  the 
tripod  is  extracted  from  the  originally  cluttered  imag<“. 
with  minimal  additional  features. 

5  Conclusions 

VVe  have  suggested  that  stereo  may  play  a  central  role 
in  object  recognition,  but  not  in  the  manner  tisually  as¬ 
sumed  in  the  literature.  U'e  have  suggested  that  stereo 
may  be  most  u.seful  in  .suitporling  figure/ground  separa¬ 
tion.  and  that  to  do  so  it  need  not  compute  explicit  3D 
information.  Supporting  this  argument  were  the  obser¬ 
vation  that  depth  reconstruction  is  extremely  sensitive 
to  accuracy  in  the  measured  camera  parameters,  and  the 
oFse-vation  tb"*'  the  human  .stereo  system  may  not  com¬ 
pute  explicit  depth. 

F‘sing  the  idea  of  depth  detectors  tuned  to  a  nar¬ 
row  range  about  a  fixation  point  has  been  previously 
explored  in  the  literature,  primarily  for  obstacle  avoid¬ 
ance  [15],  [32],  This  work  considers  the  same  general 
idea  within  the  context  of  recognition.  .Such  an  ap[)roach 
0|>ens  up  several  other  avenues  for  iuveGtig,i;loii.  F'?r  -- 
ample,  wdiat  is  the  role  of  other  visual  cues  in  aiding  the 
stereo  matching  problem.  While  one  option  is  to  aug¬ 
ment  image  features  with  attributes,  such  as  textun^  or 
color  tneasures.  an  alternative  is  to  consider  using  such 
cues  to  drive  vergence  eye  movements,  helping  to  align 
the  cameras  on  trigger  features,  so  that  the  local  matcher 
can  extract  image  features  likely  to  correspond  to  a  sin¬ 
gle  object.  We  intend  to  explore  these  and  related  issues 
in  the  near  future. 
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